Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Dec 11 16:31:06 2023"

The course started quite well and smoothly. I like the introduction by professor Kimmo and generally everything looks quite scheduled and organised. From the course I expect to improve my bioinformatic and data science related knowledge. I already have quite a bit of \(R\) skills but I’m sure I will improve those as well with this course: I really like programming and \(R\) is a quite cool language. For sure, it shines when used with Rstudio. I actually didn’t know about the Git communication!
I heard about this course from a colleague of mine.

Here’s the link to my GitHub repository.


Regression and model validation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 16:31:06 2023"
# load the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

Description of the data

# read students2014 dataset after manipulation (from data wrangling)
data <- read.csv("analysis.csv", header = T, row.names = 1)

# explore the dataset
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(data)
## [1] 166   7

The dataset is composed of 166 lines and 7 columns.
This dataset is a subset of a bigger one reporting the result of student survey. Each observation (i.e. row) is a student and the variables (i.e. columns) represent statistics of the students. Variables are:

  • gender, categorical (M or F), reporting gender of student.
  • age, numeric, reporting age of student.
  • attitude, numeric, reporing a sum of 10 questions related to students attitude towards statistics, each measured on the Likert scale (1-5)
  • deep, numeric, is a combination variable of 12 deep learning questions.
  • stra, numeric, is a combination variable of 8 strategic learning questions.
  • surf, numeric, is a combination variable of 12 “surface” questions.
  • Points, numeric, reports the exam points.

Visualization of the data

# visualize all of the variables and relationships in a single plot
ggpairs(data, mapping = aes(col = gender, alpha = 0.3),
        lower = list(combo = wrap("facethist", bins = 20)))

From the visualization we can appreciate that the distributions of the variables does not differ much when the gender difference is taken into account. Only age variable has more outliers for females than males.
Two variable pairs show significant correlation: positive for attitude and points, and negative for surf and deep. Interestingly for the latter case (surf vs deep), the negative correlation is significant only for males.

Multiple regression

# fit a model where points is the outcome variable and attitude, stra and surf are
# the explanatory variables
fit <- lm(Points ~ Attitude + stra + surf, data = data)
# summary of fitted model
summary(fit)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## Attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

As the three explanatory variables I selected attitude, stra and surf since they are that shows the highest absolute correlation with the target variable.
From the summary we can get some information about the model:

  • Estimate is the intercept (b0) and the b1 coefficient estimates associated to each predictor variable.
  • Std.Error is the standard error of the coefficient estimates, representing the accuracy of the coefficients.
  • t value represents the t-statistic
  • p-value shows the significance of the relationship of the target variable.

Looking at the individual explanatory variables, only attitude, with a pvalue of ~1.9e-8, seems to give an important contribution to the model. The other variables don’t show a significant pvalue. So, I’m removing them from the model.

# fit a model where points is the outcome variable and attitude and stra are the
# explanatory variables
fit_2 <- lm(Points ~ Attitude + stra, data = data)
# summary of fitted model
summary(fit_2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## Attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
# fit a model where points is the outcome variable and attitude is the explanatory
# variable.
fit_1 <- lm(Points ~ Attitude, data = data)
# summary of fitted model
summary(fit_1)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretation of multiple regression

# summary of fitted model with only attitude as explanatory variable
summary(fit_1)
## 
## Call:
## lm(formula = Points ~ Attitude, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## Attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# plot the two variables
ggplot(data = data, aes(x = Attitude, y = Points)) +
    geom_point() +
    geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

The explanatory variable attitude is significant associated with the target variable points (pvalue is significant, 4.12e-09). Considering the formula \(yi = β0 + β1xi\), from the summary we can appreciate that our model shows \(β0\) = 11.6372 and \(β1\) = 3.5255. This means that for each +1 of the explanatory variable, x, the target variable, y, increases of +3.5255.

The fitted model has a Multiple R-squared of 0.1906, meaning that the ~19% of the variable of the target variable is explained by the explanatory variable.

Diagnostic plots

# produce Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage 
# diagnostic plots.
par(mfrow = c(2,2))
plot(fit_1, which = c(1,2,5))

  • Residual vs Fitted plot displays the residuals vs the fitted values showing if the residuals have a non-linear pattern. If the values equally spread around the horizontal line without showing any pattern it means that non-linear relationships are not present into the data.

    • It is the case for our model: the values are equally spread (with few outliers though) and there is no evidence for any problem in the model.
  • Normal QQ-plot shows if the residuals are normally distributed: if it’s so, they are distributed on the dashed line.

    • Again, this plot doesn’t show any issue. The values are quite well placed along the line.
  • Residuals vs Laverage plot displays Standardized residuals vs leverage. In this plot values should not outside the Cook’s distance 1.

    • Values are not outside the Cook’s distance so no issues detected.

Logistic regression

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 16:31:15 2023"
# load libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(boot)
library(tidyr)

Description of the data

# load the alc dataset
alc <- read.table("data/alc.csv", sep = '\t', header = T)
# print out the names of the variables and describe the data set briefly
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The dataset is the result of the merging of two identical questionnaires including student grades, demographic, social and school related features related to secondary school student alcohol consumption in Portugal. The two original questionnaires report the performance in two distinct subjects: mathematics and Portuguese language.
The two questionnaires were merged using the common variables (columns); the questionnaire-specific columns were merged by calculating the mean of the ones with numeric data and by reporting the results of the mathematics questionnaire of the ones with non numeric data.

Making hypothesis

# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Looking at the variables, here are the four hypotheses for possible alcohol consupton relationships:

  • sex is an important variable since previous studies reported that males teenagers/young adults show a higher alcohol consuption compareed to females of the same age.
  • age is important too since people around 18-22 y/o consume more alcohol than younger peers.
  • studytime can be associated with alcohol consumption since students who spend more time studying are less prone to abuse alcohol.
  • finally famsup can be associated too. Students with more family educational support are maybe less prone to abuse alcohol.

Distributions of chosen variables

# sex
data <- data.frame(sex = c("M", "F"), 
           high_alc = c(nrow(alc[alc$sex == "M" & alc$high_use == TRUE,]),
                        nrow(alc[alc$sex == "F" & alc$high_use == TRUE,])),
           low_alc = c(nrow(alc[alc$sex == "M" & alc$high_use == FALSE,]),
                       nrow(alc[alc$sex == "F" & alc$high_use == FALSE,])))
rownames(data) <- data[,1]
data <- data[,-1]
barplot(as.matrix(data) , beside=T,col=c("blue" , "red") , ylab="")

# age
ggplot(alc, aes(x = high_use, y = age)) +
  geom_boxplot() + xlab("high alcohol consumption") + ylab("age") +
  theme_classic()

# studytime
ggplot(alc, aes(x = high_use, y = studytime)) +
  geom_boxplot() + xlab("high alcohol consumption") + ylab("studytime") +
  theme_classic()

# famsup
data <- data.frame(famsup = c("yes", "no"), 
           high_alc = c(nrow(alc[alc$famsup == "yes" & alc$high_use == TRUE,]),
                        nrow(alc[alc$famsup == "no" & alc$high_use == TRUE,])),
           low_alc = c(nrow(alc[alc$famsup == "yes" & alc$high_use == FALSE,]),
                       nrow(alc[alc$famsup == "no" & alc$high_use == FALSE,])))
rownames(data) <- data[,1]
data <- data[,-1]
barplot(as.matrix(data) , legend.text = T, beside=T,col=c("darkgreen" , "purple"),
        ylab="", main= "famsup")

According to my hypothesis, male students seem to have a higher alcohol consumption compared to female peers, resulting in sex having an important role in alcohol abuse prediction. The same goes for age: older students seem to have a higher alcohol consumption.
Interestingly, studytime looks a very important predictor of alcohol abuse, even more than sex.
The effect of famsup on the consumption is hard to establish from the plot: it seems that an effect is actually present but it is not as significant as for the other predictors.

Logistic regression

fit <- glm(high_use ~ sex + age + studytime + famsup , data = alc, family = "binomial")
summary(fit)
## 
## Call:
## glm(formula = high_use ~ sex + age + studytime + famsup, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -3.8395     1.7471  -2.198  0.02797 * 
## sexM          0.7268     0.2502   2.905  0.00367 **
## age           0.2120     0.1015   2.088  0.03681 * 
## studytime    -0.5076     0.1614  -3.144  0.00167 **
## famsupyes     0.1393     0.2501   0.557  0.57747   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 420.92  on 365  degrees of freedom
## AIC: 430.92
## 
## Number of Fisher Scoring iterations: 4
OR <- coef(fit) %>% exp
# compute confidence intervals (CI)
CI <- confint(fit) %>% exp
## Waiting for profiling to be done...
CI
##                    2.5 %    97.5 %
## (Intercept) 0.0006667136 0.6396151
## sexM        1.2698828036 3.3925294
## age         1.0150125081 1.5128364
## studytime   0.4341702473 0.8188937
## famsupyes   0.7072097008 1.8883245
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR        2.5 %    97.5 %
## (Intercept) 0.02150485 0.0006667136 0.6396151
## sexM        2.06838548 1.2698828036 3.3925294
## age         1.23609527 1.0150125081 1.5128364
## studytime   0.60193318 0.4341702473 0.8188937
## famsupyes   1.14948947 0.7072097008 1.8883245

As observed from the data visualization, sex, age and studytime are important predictor of alcohol consumption. Male students are indeed significantly more prone to abuse alcohol compared to females. Age correlates too: the older the students are the higher the alcohol consumption gets. Furethermore, as predicted and observed from the plots, studytime inversely correlates with with alcohol abuse. Famsup, on the other hand, is not correlated with the consumption: while minor effect is observed but that is not statistically significant (pvalue = 0.58).

Prediction

fit1 <- glm(high_use ~ sex + age + studytime, data = alc, family = "binomial")
summary(fit1)
## 
## Call:
## glm(formula = high_use ~ sex + age + studytime, family = "binomial", 
##     data = alc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -3.6554     1.7142  -2.132  0.03297 * 
## sexM          0.7072     0.2474   2.858  0.00426 **
## age           0.2054     0.1008   2.037  0.04161 * 
## studytime    -0.4972     0.1601  -3.105  0.00190 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 421.24  on 366  degrees of freedom
## AIC: 429.24
## 
## Number of Fisher Scoring iterations: 4
# predict() the probability of high_use
probabilities <- predict(fit1, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     92   19
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2891892

10-fold cross validation

# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = fit1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2945946

The prediction error of my model using 10-fold cross-validation is ~0.29, meaning that my model is worse than the one introduced in the Exercise Set.
A better prediction variables selection is needed to improve my model.


Dimensionality reduction techniques

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 16:31:20 2023"
# load libraries
library(tibble)
library(readr)
library(GGally)
library(FactoMineR)
library(tidyr)
library(dplyr)

Description of the data

# load the Human data
human <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# move the country names to rownames
human <- column_to_rownames(human, "Country")
# visualize the 'human' variables
ggpairs(human, progress = FALSE)

The dataset contains information about 8 qualaity-of-life variables for the different countries (155).
Looking at the pairs plot, we can appreciate that many variables have a correlation with each other. 7 comparisons underline a strong significant positive correlation between variables, namely Life.Ep vs Edu2.FM, Edu.Exp vs Edu2.FM, GNI vs Edu2.FM, Edu.Exp vs Life.Exp, GNI vs Life.Exp, GNI vs Edu.Exp and Ado.Birth vs Mat.Mort. So, out of these 7 positive correlations, 3 of them include correlation with Edu2.FM: that suggests how countries with a high proportion of males with at least secondary education have a higher Life expectancy at birth, Expected years of schooling and Gross National Income per capita.
The other way around, 8 comparisons show a strong significant negative correlation between variables, namely Mat.Mort vs Edu2.FM, Ado.Birth vs Edu2.FM, Mat.Mort vs Life.Exp, Ado.Birth vs Life.Exp, Mat.Mort vs Edu.Exp, Ado.Birth vs Edu.Exp, Mat.Mort vs GNI and Ado.Birth vs GNI.

PCA with non-scaled data

# perform PCA
pca_res <- prcomp(human)
# variability captured by the PCs
s <- summary(pca_res)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# biplot
biplot(pca_res, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA with scaled data

# scale human data
human_scaled <- scale(human)
# perform PCA
pca_res_scaled <- prcomp(human_scaled)
# variability captured by the PCs
s_scaled <- summary(pca_res_scaled)
# biplot
biplot(pca_res_scaled, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

This second PCA looks much better. In fact, the first PCA, resulting from a non-scaled data-frame, is not very informative since the variable GNI, which has values much higher compared to the other ones, is the only visible, making the other ones (and the Country names) barely recognizable.
After scaling (second PCA), all the variables are in the same range, making any interpretation of the result possible.

Interpretation of PCs

From the scaled PCA plot, I can appreciate that the PC1 (x acis), accounts for the variability of most of the variables: Mat.Mort, Abo.Birth on the positive sign, and Edu.FM, Edu.Exp, GNI and Life.Exp on the negative sign. The PC2 (y axis), on the other hand accounts for the variability of Parli.F and Labo.FM, both on the positive sign.
Focusing on the PC1, it’s interesting to appreciate that the variables describing development in different aspects of society, like life expentancy, GNI or educational expectancy points towards the negative side, meaning that more developed countries are placed on the left of the plot (of course, positive and negative does not mean anything here, I’m just stating that “good” variables goes to the left). Taken that into account, we can see that both maternal mortality ratio and Adolescent birth rate point towards right, so the opposite compared to the “good” measurements I’ve mentioned before. If that was is not surprising for maternal mortality ratio, the presence of Adolescent birth rate in the opposite direction compared to the “good” measures underlines that people in developed countries tend to have children later than in less developed ones.

MCA on tea dataset

# laod tea dataset
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
                stringsAsFactors = TRUE)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# perform MCA
mca <- MCA(tea, graph = F)
# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea, graph = F) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# biplot
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali" )

From the plot, we can see that the majority of the variables are spread along the PC2, implying that PC2 is able to “stratify” most of the questions in the tea dataset.
On the other hand, the categories unpackaged and tea shop are clearly spread on the right (positive) side of PC1.
Drawing conclusions from this plot is not easy: what we can speculate is that PC1 is able to stratify those people that we can identify as “tea connoisseurs”: those who get tea at the tea shop usually know what they want and buy unpackaged; the other way around, people who are “casual” tea drinkers, and get tea from chain store they buy tea in a tea bag (both these varaibles are located on the negative side of the PC1). ***

Analysis of longitudinal data

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 11 16:31:27 2023"
# load the GGally and ggplot2 libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

Part I

Description of the RAT data

First of all, let’s have a look at the data and give a brief description of the dataset.

# Read the RATS data
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt",
                   sep = "\t", header = T)

# Look at the (column) names of RATS
names(RATS)
##  [1] "ID"    "Group" "WD1"   "WD8"   "WD15"  "WD22"  "WD29"  "WD36"  "WD43" 
## [10] "WD44"  "WD50"  "WD57"  "WD64"
# Look at the structure of RATS
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...
# Print out summaries of the variables
summary(RATS)
##        ID            Group           WD1             WD8             WD15      
##  Min.   : 1.00   Min.   :1.00   Min.   :225.0   Min.   :230.0   Min.   :230.0  
##  1st Qu.: 4.75   1st Qu.:1.00   1st Qu.:252.5   1st Qu.:255.0   1st Qu.:255.0  
##  Median : 8.50   Median :1.50   Median :340.0   Median :345.0   Median :347.5  
##  Mean   : 8.50   Mean   :1.75   Mean   :365.9   Mean   :369.1   Mean   :372.5  
##  3rd Qu.:12.25   3rd Qu.:2.25   3rd Qu.:480.0   3rd Qu.:476.2   3rd Qu.:486.2  
##  Max.   :16.00   Max.   :3.00   Max.   :555.0   Max.   :560.0   Max.   :565.0  
##       WD22            WD29            WD36            WD43      
##  Min.   :232.0   Min.   :240.0   Min.   :240.0   Min.   :243.0  
##  1st Qu.:267.2   1st Qu.:268.8   1st Qu.:267.2   1st Qu.:269.2  
##  Median :351.5   Median :356.5   Median :360.0   Median :360.0  
##  Mean   :379.2   Mean   :383.9   Mean   :387.0   Mean   :386.0  
##  3rd Qu.:492.5   3rd Qu.:497.8   3rd Qu.:504.2   3rd Qu.:501.0  
##  Max.   :580.0   Max.   :590.0   Max.   :597.0   Max.   :595.0  
##       WD44            WD50            WD57            WD64      
##  Min.   :244.0   Min.   :238.0   Min.   :247.0   Min.   :245.0  
##  1st Qu.:270.0   1st Qu.:273.8   1st Qu.:273.8   1st Qu.:278.0  
##  Median :362.0   Median :370.0   Median :373.5   Median :378.0  
##  Mean   :388.3   Mean   :394.6   Mean   :398.6   Mean   :404.1  
##  3rd Qu.:510.5   3rd Qu.:516.0   3rd Qu.:524.5   3rd Qu.:530.8  
##  Max.   :595.0   Max.   :612.0   Max.   :618.0   Max.   :628.0

The RATS dataset in the wide data form: we have 16 subjects, rats, that are split into three groups and biological measurements are taken for each rat every week. Time points are WD1, WD8, WD15, WD22, WD29, WD36, WD43, WD44, WD50, WD57 and WD64.

Graphical Displays of Longitudinal Data

# convert the ID and Group columns to factors
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

# convert to longitudinal form
RATS_L <- pivot_longer(RATS, cols = -c(ID, Group), names_to = "WD",
                     values_to = "Weight") %>%
  mutate(Time = as.integer(substr(WD, 3,4))) %>%
  arrange(Time)

dim(RATS_L)
## [1] 176   5
glimpse(RATS_L)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …

Now the dataset is in the longitudinal form. In the wide form, we had one line per each rat where variables were the groups and the time-points when the biological measurements were taken. However, in the longitudinal form, we have as many lines as many rats * time points (and 16 * 11): each line now is rat #1 - time point #1 - biological measurement for time point #1.

# draw the plot
ggplot(RATS_L, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS_L$Weight), max(RATS_L$Weight)))

From this first plot, we can already appreciate that the weight of almost all the rats tend to increase during the treatment. Also, we can observe the tracking phenomenon (rats whose weight was high at the beginning weight more throughout the study).
However, the non-standardized nature of the data makes drawing conclusions hard.

# standardise the variable Weight
RATS_L_st <- RATS_L %>%
  group_by(Time) %>%
  mutate(Weight_st = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()
# look at the data
glimpse(RATS_L_st)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ Weight_st <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# plot again with the standardised Weight
ggplot(RATS_L_st, aes(x = Time, y = Weight_st, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

# summary data with mean and standard error of weight by treatment and week 
RATS_L_st_se <- RATS_L_st %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = (sd(Weight)/sqrt(Weight)) ) %>%
  ungroup()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'Group', 'Time'. You can override using the
## `.groups` argument.
glimpse(RATS_L_st_se)
## Rows: 176
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Time  <int> 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8, 8, 8, 15, 15, 15, 15, …
## $ mean  <dbl> 250.625, 250.625, 250.625, 250.625, 250.625, 250.625, 250.625, 2…
## $ se    <dbl> 0.9825486, 1.0147718, 0.9724709, 0.9440022, 0.9532122, 0.9440022…
# plot the mean profiles
ggplot(RATS_L_st_se, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

From this last plot, showing average profiles for each rats group with mean and standard error at each time point, we can see that there is no overlap in the means of the rat groups.

Another visualization

# create a summary data by Group and ID with mean as the summary variable
# (ignoring baseline Time 1)
RATS_S <- RATS_L %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
glimpse(RATS_S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, 4…
# draw a boxplot of the mean versus Group
ggplot(RATS_S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Days 1-64")

The created boxplots reflect quite well what we observed from the previous plots. This plot also shows that the variability of Group 3 is the smallest. All the groups are characterized by the presence of an outlier each: we can try to remove them.

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATS_S_clean <- RATS_S[!(RATS_S$Group == 3 & RATS_S$mean < 500),] %>%
  subset(mean < 590) %>% subset(mean > 240)

# plot again
ggplot(RATS_S_clean, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), Days 1-64")

ANOVA

Since t-test require 2 levels (Groups) and we have 3 of them, such test can’t be applied. Let’s move to ANOVA instead

# Add the RATS_S from the original data as a new variable to the summary data
RATS_S <- RATS_S %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit1 <- lm(mean ~ baseline + Group, data = RATS_S)
summary(fit1)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATS_S)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.905  -4.194   2.190   7.577  14.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.16375   21.87657   1.516   0.1554    
## baseline     0.92513    0.08572  10.793 1.56e-07 ***
## Group2      34.85753   18.82308   1.852   0.0888 .  
## Group3      23.67526   23.25324   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.68 on 12 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.992 
## F-statistic: 622.1 on 3 and 12 DF,  p-value: 1.989e-13
# Compute the analysis of variance table for the fitted model with anova()
anova(fit1)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the model and at the ANOVA testing, when the baseline is taken into account, the groups don’t seem to be significantly different anymore. In other words, the baseline is explaining most of variability between the two groups.

Part II

Description of the BPRS data

First of all, let’s have a look at the data and give a brief description of the dataset.

# Read the BPRS data
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt",
                   sep = " ", header = T)

# Look at the (column) names of RATS
names(BPRS)
##  [1] "treatment" "subject"   "week0"     "week1"     "week2"     "week3"    
##  [7] "week4"     "week5"     "week6"     "week7"     "week8"
# Look at the structure of RATS
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...
# Print out summaries of the variables
summary(BPRS)
##    treatment      subject          week0           week1           week2     
##  Min.   :1.0   Min.   : 1.00   Min.   :24.00   Min.   :23.00   Min.   :26.0  
##  1st Qu.:1.0   1st Qu.: 5.75   1st Qu.:38.00   1st Qu.:35.00   1st Qu.:32.0  
##  Median :1.5   Median :10.50   Median :46.00   Median :41.00   Median :38.0  
##  Mean   :1.5   Mean   :10.50   Mean   :48.00   Mean   :46.33   Mean   :41.7  
##  3rd Qu.:2.0   3rd Qu.:15.25   3rd Qu.:58.25   3rd Qu.:54.25   3rd Qu.:49.0  
##  Max.   :2.0   Max.   :20.00   Max.   :78.00   Max.   :95.00   Max.   :75.0  
##      week3           week4           week5           week6           week7     
##  Min.   :24.00   Min.   :20.00   Min.   :20.00   Min.   :19.00   Min.   :18.0  
##  1st Qu.:29.75   1st Qu.:28.00   1st Qu.:26.00   1st Qu.:22.75   1st Qu.:23.0  
##  Median :36.50   Median :34.50   Median :30.50   Median :28.50   Median :30.0  
##  Mean   :39.15   Mean   :36.35   Mean   :32.55   Mean   :31.23   Mean   :32.2  
##  3rd Qu.:44.50   3rd Qu.:43.00   3rd Qu.:38.00   3rd Qu.:37.00   3rd Qu.:38.0  
##  Max.   :76.00   Max.   :66.00   Max.   :64.00   Max.   :64.00   Max.   :62.0  
##      week8      
##  Min.   :20.00  
##  1st Qu.:22.75  
##  Median :28.00  
##  Mean   :31.43  
##  3rd Qu.:35.25  
##  Max.   :75.00

The BPRS dataset in the wide data form: we have 40 subjects, men, that are split into two groups and biological measurements are taken for each man every week, from week0 (baseline measurement) to week8.

Convert to longitudinal form

# convert categorical variables to factors
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
# convert to longitudinal form
BPRS_L <-  pivot_longer(BPRS, cols= -c(treatment,subject), names_to = "weeks",
                       values_to = "bprs") %>% arrange(weeks)
BPRS_L <-  BPRS_L %>% mutate(week = as.integer(substr(weeks,5,5)))

Now the dataset is in the longitudinal form. In the wide form, we had one line per each man where variables were the groups and the time-points when the biological measurements were taken. However, in the longitudinal form, we have as many lines as many men * time points (and 40 * 9): each line now is man #1 - time point #1 - biological measurement for time point #1.

Visualise the data

# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = bprs, group = interaction(subject, treatment))) +
  geom_line(aes(linetype = treatment )) + 
  scale_y_continuous(name = "bprs") + 
  theme(legend.position = "top")

Looking at the plot, it looks like there’s no difference in the effect of the two different treatments since the lines from the two groups overlap quite well.

Linear model

# create a regression model RATS_reg
fit_BPRS <- lm(bprs ~ week + treatment, data = BPRS_L)

# print out a summary of the model
summary(fit_BPRS)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS_L)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

From the fitted linear model, we can see that time (week variable) is very significant in the regression. Treatment2 doesn’t look important though. However, the independence assumption is violated, so we would need a random intercept model.

The Random Intercept Model

As explained before, the linear model assumes independence of the repeated measures of bprs, but it’s not the case: we need to fit a random intercept model for the same two explanatory variables, week and treatment. This will allow the linear regression fit for each man to differ in intercept from other men.

library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step
# Create a random intercept model
fit_BPRS_2 <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS_L,
                   REML = FALSE)

# Print the summary of the model
summary(fit_BPRS_2)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS_L
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     1.9090  37.2392  24.334   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment2    0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

From the Random Intercept Model we can have a look at the random effects. The subject derived effect looks to be very important in the model. This can be observed by the fact that t.test is given by estimate (variance, in this case 104.21) / SD (SD = SE/sqrt(sample size)): if the resulting t.test is > abs(1.93), then the pvalue is significant and the effect is significant as well.
by running this model wich takes into account the randomness coming from the subject, we correct the fixed effects and now we can trust these results

Slippery slopes: Random Intercept and Random Slope Model

Now we can move on to fit the random intercept and random slope model. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope.

fit_BPRS_3 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS_L,
                   REML = FALSE)

# print a summary of the model
summary(fit_BPRS_3)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS_L
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  46.4539     2.1052  22.6795  22.066  < 2e-16 ***
## week         -2.2704     0.2977  19.9991  -7.626 2.42e-07 ***
## treatment2    0.5722     1.0405 320.0005   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(fit_BPRS_3, fit_BPRS_2)
## Data: BPRS_L
## Models:
## fit_BPRS_2: bprs ~ week + treatment + (1 | subject)
## fit_BPRS_3: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## fit_BPRS_2    5 2748.7 2768.1 -1369.4   2738.7                       
## fit_BPRS_3    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When adding week as a random effect, we improve the model (anova testing results in a pvalue < 0.05). This means that week is important as well as random effect.

Random Intercept and Random Slope Model with interaction

# create a random intercept and random slope model with the interaction
fit_BPRS_interaction <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS_L,
                   REML = FALSE)

# print a summary of the model
summary(fit_BPRS_interaction)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS_L
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)      47.8856     2.2521  29.6312  21.262  < 2e-16 ***
## week             -2.6283     0.3589  41.7201  -7.323 5.24e-09 ***
## treatment2       -2.2911     1.9090 319.9977  -1.200   0.2310    
## week:treatment2   0.7158     0.4010 319.9977   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(fit_BPRS_interaction, fit_BPRS_3)
## Data: BPRS_L
## Models:
## fit_BPRS_3: bprs ~ week + treatment + (week | subject)
## fit_BPRS_interaction: bprs ~ week * treatment + (week | subject)
##                      npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## fit_BPRS_3              7 2745.4 2772.6 -1365.7   2731.4                       
## fit_BPRS_interaction    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = bprs, group = interaction(subject, treatment))) +
  geom_line(aes(linetype = treatment )) + 
  scale_y_continuous(name = "bprs") + 
  theme(legend.position = "top")

# Create a vector of the fitted values and add to BPRS_L
Fitted <- fitted(fit_BPRS_interaction)
BPRS_L$fitted <- Fitted

# plot the RATSL data
ggplot(BPRS_L, aes(x = week, y = fitted, group = interaction(subject, treatment))) +
  geom_line(aes(linetype = treatment )) + 
  scale_y_continuous(name = "bprs") + 
  theme(legend.position = "top")

Finally, when we fit a model with the interaction between week and treatment we get no significant difference from the previous one (anova pvalue > 0.05).